CodeKernel: A Graph Kernel based Approach to the Selection of API Usage Examples (2019)

Therefore, these source code representations could result in imprecise code similarity measurement. The produced code examples are often inaccurate and difficult for developers to reuse in programming practice. Nguyen et al. proposed GrouMiner (Graph-based mining of multiple object usage patterns (2009))

これはgraph-basedのmining Graph-based pattern-oriented, context-sensitive source code completion

しかし GrouMIner は redundant な結果を返しがち

How can I use this method? (2015) で提案されたMUSEは code clone detection 技術を使って、code snippet の clustering を行う。いいんだけど、MUSEのcode clone detecion技術はあくまでtext-baseのclone detectionなのでinaccurateなclusteringが多い。

CodeKernel を提案、以下のような characteristics を持つよ

Graph-based pattern-oriented, context-sensitive source code completion

code を method call sequence や

CodeKernel clusters the similar graphs through graph kerne

MAPO は frequent call sequence を mining し、 MUSE は code clone detection によって似たような結果を clustering していたのに対し

こういう感じのワークフロー

CodeKernel ﬁrst builds object usage graphs for each function.

It then clusters the graphs through graph embedding

Finally, CodeKernel selects the representative graph of each cluster using ranking metrics.

hr.icon

3 Background

graph kernel についての説明

Shortest-path kernels on graphs (2005)

Protein function prediction via graph kernels (2005)

Contextual weisfeiler-lehman graph kernel for malware detection

hr.icon

4. Approach

https://scrapbox.io/files/608e18c452131a001c72c3a7.png

The offline processing is responsible for selectingcode examples. It gathers relevant code snippets for eachAPI and selects API usage examples using CodeKernel

At runtime, for a given API query (such as F ileReader.read), it identiﬁes and presents relevant code examples to users.

a. A. Graph Representation for Source Code

まずはソースコードを object usage graph に変換する

An object usage graph is a directed acyclic graph deﬁned as G = (V, E) where

V stands for a set of nodes (controls, actions, and data) and

E ⊆ V × V denotes a set of edges representing call sequences or data dependencies

Each node is associated with a label representing a class/method name or a control unit

node の種類

The action nodes such as StringBuffer.new and BufferedReader.readLine stand for method calls or ﬁeld accesses.

The data nodes such as StringBuffer and BufferedReader represent objects of a class.

edge の種類

There are two types of edges, sequential edges and data edges.

Sequential edges connect nodes with strict orders. For example, BufferedReader.new must be executed before BufferedReader.readLine.

control flow ってこと?

Data edges connect a data node with action nodes if action nodes use objects or parameters of the data node. For example, BufferedReader.readLine and BufferedReader.close are connected with data node BufferedReader since they both use the object br deﬁned in the data node.

data flow ってこと?

data node にはそのクラスの型が書かれている？

https://scrapbox.io/files/608e1dd3d9cb81001c535d43.png

node

四角が action node

丸角が data node

edge

波線が data edge

-> が sequential edge

作ったグラフは隣接行列(adjacency matrix)と label vectorに変換

n * n 行列 (n は node の数)

各entryは

0: edge なし

1: sequential edge

2: data edge

Graph-based mining of multiple object usage patterns の GrouMiner を使ってプログラムをグラフに変換する

https://scrapbox.io/files/608e72562a35610022b75d63.png

B. Graph Embedding

Accurate and efficient structural characteristic feature extraction for clone detection などでは graph から feature vector を計算していたが、CodeKernel ではすべてのグラフのすべての2ペアの組み合わせに対して

we employ the graph kernel to embed the original graphs into a high-dimensional, continuous space in which their inner products can be accurately calculated while being computationally cheap

非負 K_N*N 行列に変換 (Nはグラフの数)

(GraphKernel を用いた類似度を計算)

詳細は Shortest-path kernels on graphs (2005)

Borgwardt’s code in Matlab を利用

C. Graph Clustering

Document clustering using locality preserving indexing (2005) のように Spectral Clustering| Unsupervised Learning | Eigenvalues | Absolutdata を使ってグラフのクラスタリングを行う

the most important reason we use the spectral clustering is that it ﬁts our data well

何故?

In our problem, data in the continuous space is not vectorial. Therefore, algorithms that require vectorial inputs such as K-means, Gaussian mixture model and EM are not applicable. Spectral clustering, on the other hand, takes as input similarity pairs instead of vectors.

D. Example Selection

クラスタごとに代表グラフを選択

選択したグラフから、元のコードを復元

1) Rank metrics

代表グラフを選択するためのメトリクスとして以下のものを定めた

Centrality

We ﬁrst want the representative graph to be as generic as possible in the cluster.

できるだけ、同一クラスタ内のすべてのグラフと似てるグラフを代表としたい

Inspired by A simple and fast algorithm for k-medoids clustering (2009), Centarality メトリックを定義した, which measures the average distance from a graph to other graphs in the cluster.

クラスタ内の各グラフ gi について

https://scrapbox.io/files/608f885ee4117f001d8ee0d0.png

Specificity

Centrality だけでは、選択されるグラフは大きなグラフになりがち(大きいグラフは小さいやつらと似たようなグラフ構造を持っているはずだから)

大きいグラフには specific edge (そのグラフにしか現れないようなコード片)がよく現れる。

Specificity を penalize するためのメトリックを定義

https://scrapbox.io/files/608f89db9b1823001cabbbe6.png

E_gi はグラフ gi の edge の数

https://scrapbox.io/files/608f8a4b00e276001fac34af.png (inversed document frequency) rareness of the edge

|C| はクラスタに含まれるグラフの数

N_e は number of times the edge e appears in the cluster C

2) Representative Graph Selection

最終的なスコアは score = centrality − γ · speciﬁcity γ は specificity をどのくらい重視するかのパラメータ、この実験では0.2とする

検索結果にマッチしたクラスター

どういうふうにマッチさせるのか?????????????

検索がクエリのAPIが edge に含まれていたらマッチする?

最終的なランキングは、グラフの属しているクラスタのサイズ大きい順

hr.icon

5. Empirical Evaluation

RQ1: How accurate are the API usage examples selected by CodeKernel?

ここでいう accurate って何?

RQ2: How useful is CodeKernel for selecting API usage examples?

RQ3: Does graph kernel help improve the graph clustering performance ?

A. Accuracy of Selected Examples(RQ1)

1) Accuracy of Code Clustering